University Assignment: Explore the use of word embeddings and sentiment analysis techniques in natural language processing (NLP). Use a dataset of movie reviews to create a word embedding model using Word2Vec and evaluate the performance of the model in sentiment analysis tasks.
Dataset: IMDb Movie Reviews The dataset consists of movie reviews from the IMDb website, along with their corresponding sentiment labels (positive or negative). The dataset is divided into a training set and a test set, with 25,000 reviews in each set.
import warnings
warnings.filterwarnings('ignore')
#read dataset
import pandas as pd
df = pd.read_csv('IMDB Dataset.csv')
df.head()
| review | sentiment | |
|---|---|---|
| 0 | One of the other reviewers has mentioned that ... | positive |
| 1 | A wonderful little production. <br /><br />The... | positive |
| 2 | I thought this was a wonderful way to spend ti... | positive |
| 3 | Basically there's a family where a little boy ... | negative |
| 4 | Petter Mattei's "Love in the Time of Money" is... | positive |
df.shape
(50000, 2)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 review 50000 non-null object 1 sentiment 50000 non-null object dtypes: object(2) memory usage: 781.4+ KB
print(
"This dataset has two variables: review and sentiment.\n"
"Review is the movie reviews on IMDB\n"
"Sentiment is either positive or negative sentiment that the review is expressing"
)
This dataset has two variables: review and sentiment. Review is the movie reviews on IMDB Sentiment is either positive or negative sentiment that the review is expressing
#!pip install nltk
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))
[nltk_data] Downloading package stopwords to /Users/kiko/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package punkt to /Users/kiko/nltk_data... [nltk_data] Package punkt is already up-to-date!
# Split the dataframe into training and testing sets
X= df['review']
y=df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(
X,y , random_state=104,test_size=0.25)
train_df = X_train.to_frame()
test_df = X_test.to_frame()
train_df.head()
| review | |
|---|---|
| 11681 | Oh just what I needed,another movie about 19th... |
| 24009 | I saw this only because my 10-yr-old was bored... |
| 40502 | The show itself basically reflects the typical... |
| 755 | A well-made run-of-the-mill movie with a tragi... |
| 26143 | I just bought this movie yesterday night, and ... |
test_df.head()
| review | |
|---|---|
| 39550 | I went into this film expecting it to be simil... |
| 11244 | Funny that I find myself forced to review this... |
| 40728 | This film is really really bad, it is not very... |
| 40580 | Now, I haven't read the original short story t... |
| 46371 | I suppose I should be fair and point out that ... |
def clean_text(df):
# change to lower and remove spaces on either side
df['clean_col'] = df['review'].apply(lambda x: x.lower().strip())
# remove extra spaces in between
df['clean_col'] = df['clean_col'].apply(lambda x: re.sub(' +', ' ', x))
# remove punctuation
df['clean_col'] = df['clean_col'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))
# remove digits
df['clean_col'] = df['clean_col'].apply(lambda x: re.sub(r'[0-9]+', ' ', x))
# remove stopwords
df['clean_col'] = df['clean_col'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
#tokenize
df['clean_col'] = df['clean_col'].apply(lambda x: word_tokenize(x))
return df
#clean train and test dataset
clean_text(train_df)
| review | clean_col | |
|---|---|---|
| 11681 | Oh just what I needed,another movie about 19th... | [oh, needed, another, movie, th, century, engl... |
| 24009 | I saw this only because my 10-yr-old was bored... | [saw, yr, old, bored, friend, hated, course, l... |
| 40502 | The show itself basically reflects the typical... | [show, basically, reflects, typical, nature, a... |
| 755 | A well-made run-of-the-mill movie with a tragi... | [well, made, run, mill, movie, tragic, ending,... |
| 26143 | I just bought this movie yesterday night, and ... | [bought, movie, yesterday, night, love, everyo... |
| ... | ... | ... |
| 31240 | (Some spoilers included:)<br /><br />Although,... | [spoilers, included, br, br, although, many, c... |
| 40664 | This movie had very few moments of real drama.... | [movie, moments, real, drama, opening, minutes... |
| 39078 | The third film in a cycle of incomparably bril... | [third, film, cycle, incomparably, brilliant, ... |
| 49881 | Definitely an odd debut for Michael Madsen. Ma... | [definitely, odd, debut, michael, madsen, mads... |
| 8261 | Actually my vote is a 7.5. Anyway, the movie w... | [actually, vote, anyway, movie, good, funny, p... |
37500 rows × 2 columns
clean_text(test_df)
| review | clean_col | |
|---|---|---|
| 39550 | I went into this film expecting it to be simil... | [went, film, expecting, similar, matrix, pi, b... |
| 11244 | Funny that I find myself forced to review this... | [funny, find, forced, review, movie, br, br, r... |
| 40728 | This film is really really bad, it is not very... | [film, really, really, bad, well, done, lack, ... |
| 40580 | Now, I haven't read the original short story t... | [read, original, short, story, know, literary,... |
| 46371 | I suppose I should be fair and point out that ... | [suppose, fair, point, believe, ghosts, said, ... |
| ... | ... | ... |
| 14015 | This is one of the best Fred Astaire-Ginger Ro... | [one, best, fred, astaire, ginger, rogers, fil... |
| 15507 | Police Story is one of Jackie Chan's classic f... | [police, story, one, jackie, chan, classic, fi... |
| 31089 | This is not a good movie but I still like it. ... | [good, movie, still, like, cat, clovis, gold, ... |
| 26840 | Red Rock West is one of those rare films that ... | [red, rock, west, one, rare, films, keeps, gue... |
| 14029 | After seeing this routine by John Leguizamo, I... | [seeing, routine, john, leguizamo, finally, re... |
12500 rows × 2 columns
Preprocess the data and explore its characteristics.
Data Cleaning
Load the training set and test set into two separate dataframes. Clean the text data by removing punctuation, digits, and stop words. Tokenize the text data and convert it to lowercase.
Data Exploration
Compute and plot the distribution of the length of the reviews (in terms of the number of words). Compute the frequency of the top-k most common words in the training set.
# Code snippet for data exploration
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
def count_words(df):
#use this function to count all words
df['review_length'] = df['clean_col'].apply(len)
# Distribution of review lengths
#this plot sohuld have the length of reviews along x axis and the frequency of that length along y axis
#plot for train dataset
count_words(train_df)
sns.distplot(train_df['review_length'],kde=False,bins=50)
<AxesSubplot:xlabel='review_length'>
#plot review length distribution for test dataset
count_words(test_df)
sns.distplot(test_df['review_length'],kde=False,bins=50)
<AxesSubplot:xlabel='review_length'>
# Frequency of top-k most common words
#I am using top-150 words
from collections import Counter
lst = train_df['clean_col'].explode().to_list()
Counter = Counter(lst).most_common(150)
Counter[:10]
[('br', 151788),
('movie', 66139),
('film', 59605),
('one', 40288),
('like', 30140),
('good', 22465),
('time', 18861),
('even', 18592),
('would', 18433),
('story', 17442)]
#create a list of the top 150 words
top_150_word = []
i = 0
while i < len(Counter):
top_150_word.append(Counter[i][0])
i += 1
Create a word embedding model using Word2Vec.
Word2Vec Training
Train a Word2Vec model on the cleaned text data in the training set. Save the trained model to a file for later use.
Word Embedding Visualization
Visualize the embeddings of the top-k most common words in the training set using t-SNE. Visualize the embeddings of the words "good", "bad", "great", and "terrible" using t-SNE.
# Use Word2Vec to create and save your word2vec model
import gensim
from gensim.models import Word2Vec
text = train_df.clean_col.tolist()
# Create Word2Vec model for training set
model = Word2Vec(sentences=text)
# Code snippet for embedding visualization
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np
# Visualize embeddings of top-150 words. Create a 2d embedding array using TSNE and use that for your visualisation
num_components = 2
model_vector_lst = []
for i in top_150_word:
model_vector_lst.append(model.wv[i])
vectors = np.asarray(model_vector_lst)
labels = np.asarray(top_150_word)
# apply TSNE
tsne = TSNE(n_components=num_components, random_state=0)
vectors = tsne.fit_transform(vectors)
x_vals = [v[0] for v in vectors]
y_vals = [v[1] for v in vectors]
def plot_embeddings(x_vals, y_vals, labels):
import plotly.graph_objs as go
fig = go.Figure()
trace = go.Scatter(x=x_vals, y=y_vals, mode='markers', text=labels)
fig.add_trace(trace)
fig.update_layout(title="Word2Vec - Visualization embedding con TSNE")
fig.show()
return fig
plot = plot_embeddings(x_vals, y_vals, labels)
#This is an interactive plot. If you hover on the points, you can see the word/label
# Visualize embeddings of specific words. Create a 2d embedding array using TSNE and use that for your visualisation
words_to_visualize = ["good", "bad", "great", "terrible"]
num_components = 2
model_vector_lst = []
for i in words_to_visualize:
model_vector_lst.append(model.wv[i])
vectors = np.asarray(model_vector_lst)
labels = np.asarray(words_to_visualize)
# apply TSNE
tsne = TSNE(n_components=num_components, random_state=0, perplexity=1)
vectors = tsne.fit_transform(vectors)
x_vals = [v[0] for v in vectors]
y_vals = [v[1] for v in vectors]
plot = plot_embeddings(x_vals, y_vals, labels)
#This is an interactive plot. If you hover on the points, you can see the word/label.
Use the word embeddings to perform sentiment analysis on the test set and evaluate the performance of the model.
Sentiment Analysis
Convert the cleaned text data in the test set to vectors using the trained Word2Vec model. Train a logistic regression model on the vector representations of the text data in the training set. Use the trained logistic regression model to predict the sentiment labels (positive or negative) of the text data in the test set.
Evaluation
Calculate the accuracy, precision, recall, and F1 score of the sentiment analysis model. Visualize the confusion matrix of the sentiment analysis model.
# To make the analysis faster, I am going to convert the whole dataset into vectors first then perform
# the train/test datasplit. I will be using the same random state as before (random_state=104) so
# the result will be the same.
df.head()
| review | sentiment | |
|---|---|---|
| 0 | One of the other reviewers has mentioned that ... | positive |
| 1 | A wonderful little production. <br /><br />The... | positive |
| 2 | I thought this was a wonderful way to spend ti... | positive |
| 3 | Basically there's a family where a little boy ... | negative |
| 4 | Petter Mattei's "Love in the Time of Money" is... | positive |
clean_text(df)
| review | sentiment | clean_col | |
|---|---|---|---|
| 0 | One of the other reviewers has mentioned that ... | positive | [one, reviewers, mentioned, watching, oz, epis... |
| 1 | A wonderful little production. <br /><br />The... | positive | [wonderful, little, production, br, br, filmin... |
| 2 | I thought this was a wonderful way to spend ti... | positive | [thought, wonderful, way, spend, time, hot, su... |
| 3 | Basically there's a family where a little boy ... | negative | [basically, family, little, boy, jake, thinks,... |
| 4 | Petter Mattei's "Love in the Time of Money" is... | positive | [petter, mattei, love, time, money, visually, ... |
| ... | ... | ... | ... |
| 49995 | I thought this movie did a down right good job... | positive | [thought, movie, right, good, job, creative, o... |
| 49996 | Bad plot, bad dialogue, bad acting, idiotic di... | negative | [bad, plot, bad, dialogue, bad, acting, idioti... |
| 49997 | I am a Catholic taught in parochial elementary... | negative | [catholic, taught, parochial, elementary, scho... |
| 49998 | I'm going to have to disagree with the previou... | negative | [going, disagree, previous, comment, side, mal... |
| 49999 | No one expects the Star Trek movies to be high... | negative | [one, expects, star, trek, movies, high, art, ... |
50000 rows × 3 columns
df['clean_col2'] = [' '.join(map(str, l)) for l in df['clean_col']]
df.head()
| review | sentiment | clean_col | clean_col2 | |
|---|---|---|---|---|
| 0 | One of the other reviewers has mentioned that ... | positive | [one, reviewers, mentioned, watching, oz, epis... | one reviewers mentioned watching oz episode ho... |
| 1 | A wonderful little production. <br /><br />The... | positive | [wonderful, little, production, br, br, filmin... | wonderful little production br br filming tech... |
| 2 | I thought this was a wonderful way to spend ti... | positive | [thought, wonderful, way, spend, time, hot, su... | thought wonderful way spend time hot summer we... |
| 3 | Basically there's a family where a little boy ... | negative | [basically, family, little, boy, jake, thinks,... | basically family little boy jake thinks zombie... |
| 4 | Petter Mattei's "Love in the Time of Money" is... | positive | [petter, mattei, love, time, money, visually, ... | petter mattei love time money visually stunnin... |
# Count vectorization of text
from sklearn.feature_extraction.text import CountVectorizer
corpus = df['clean_col2'].values
# Creating the vectorizer
vectorizer = CountVectorizer(stop_words='english')
# Converting the text to numeric data
X = vectorizer.fit_transform(corpus)
# Preparing DataFrame for machine learning
# Priority column acts as the target variable and other columns as predictors
CountVectorizedData = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
# Creating the list of words which are present in the Document term matrix
WordsVocab=CountVectorizedData.columns
# Printing sample words
WordsVocab[0:10]
Index(['aa', 'aaa', 'aaaaaaaaaaaahhhhhhhhhhhhhh', 'aaaaaaaargh', 'aaaaaaah',
'aaaaaaahhhhhhggg', 'aaaaagh', 'aaaaah', 'aaaaahhhh', 'aaaaargh'],
dtype='object')
len(WordsVocab)
99056
# Converting the text to numeric data
X = vectorizer.transform(df['clean_col2'])
CountVecData=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
W2Vec_Data=pd.DataFrame()
# Create loop
for i in range(CountVecData.shape[0]):
Sentence = np.zeros(100)
# Looping thru each word in the sentence and if its present in
# the Word2Vec model then storing its vector
for word in WordsVocab[CountVecData.iloc[i , :]>=1]:
if word in model.wv.key_to_index.keys():
Sentence=Sentence+model.wv[word]
# Appending the sentence to the dataframe
W2Vec_Data=W2Vec_Data.append(pd.DataFrame([Sentence]))
W2Vec_Data
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.845066 | 21.074142 | -27.039258 | 9.002568 | -18.449693 | -19.489673 | 24.202210 | 13.770300 | -63.909698 | -30.462318 | ... | 43.626524 | 50.205337 | -11.977495 | -2.651355 | 63.870902 | 33.575529 | -6.873737 | -4.653884 | 33.407272 | -3.028877 |
| 0 | 4.646513 | -27.095638 | -11.011114 | -15.893251 | -7.019846 | -9.543718 | 17.137718 | 32.488901 | -29.658915 | -11.607785 | ... | 18.276616 | 23.801546 | 11.551250 | 22.952833 | 43.372218 | 34.946268 | 3.623533 | -17.111816 | 29.811776 | -2.583454 |
| 0 | 27.581128 | -4.100506 | -15.049071 | 1.995729 | -18.761900 | -7.315710 | 37.401362 | 15.316645 | -20.697380 | -5.300695 | ... | 24.438132 | 35.133365 | -9.264058 | 15.200469 | 46.843492 | 23.440720 | 1.250978 | -10.280586 | 19.974900 | 13.492671 |
| 0 | 4.278052 | 5.576408 | 4.444432 | -4.695909 | -16.734504 | -0.619649 | 22.183735 | 6.021529 | -20.397884 | -14.555054 | ... | 26.877040 | 18.771703 | -4.897337 | 6.664134 | 30.610391 | 19.250141 | 4.208647 | -3.795059 | 12.350482 | 15.395156 |
| 0 | 2.725437 | -20.192761 | -15.919149 | 0.168059 | -8.541662 | -26.854808 | 26.608201 | 8.664394 | -18.287214 | 6.231360 | ... | 27.186128 | 43.786424 | 5.449669 | 19.755477 | 72.035188 | 26.304189 | 5.238519 | 1.304170 | 6.507701 | 21.170385 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 0 | 10.808050 | -14.689696 | -9.491751 | 12.295364 | -9.635763 | 0.565626 | 43.320826 | -2.752151 | -22.291317 | -10.542328 | ... | -3.435190 | 27.442312 | -15.341893 | 35.154118 | 58.849406 | 30.870952 | 11.762682 | -21.430055 | 14.595973 | 22.106617 |
| 0 | 17.146678 | 3.327781 | 0.254207 | -4.129948 | -11.971751 | -5.713457 | 27.006311 | 15.177977 | -8.910854 | -24.821422 | ... | 24.132641 | 15.240886 | -12.247834 | 10.002461 | 30.757395 | 18.018837 | 6.508978 | -10.108581 | 9.777080 | 5.976269 |
| 0 | 2.415512 | -0.835904 | -2.650888 | -4.701153 | -6.430142 | -20.942541 | 15.420227 | -1.658896 | -23.537929 | -1.566711 | ... | 44.268204 | 19.043730 | 11.429349 | 11.499594 | 39.452780 | 17.482174 | -28.298990 | 2.480149 | 10.672934 | 8.731389 |
| 0 | 2.747474 | 6.171118 | 1.112332 | -0.504889 | 0.562939 | -14.652804 | 20.821587 | 35.256244 | -20.524770 | -6.837540 | ... | 35.610543 | 21.499424 | -2.286395 | 6.553049 | 51.991320 | 19.468075 | 11.514976 | -11.060579 | 12.789450 | 5.820660 |
| 0 | 24.450603 | -12.436791 | -2.608275 | 10.086381 | -17.740947 | -8.711633 | 32.252910 | 9.720001 | -17.300261 | -23.208122 | ... | -1.322157 | 21.698793 | -11.248874 | -0.543811 | 42.189506 | 13.664176 | 9.015829 | -11.308207 | 4.205468 | 4.607011 |
50000 rows × 100 columns
W2Vec_Data.shape
(50000, 100)
#ML on training set
# Adding the target variable
W2Vec_Data.reset_index(inplace=True, drop=True)
W2Vec_Data['sentiment']=df['sentiment']
# Assigning to DataForML variable
DataForML=W2Vec_Data
DataForML.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.845066 | 21.074142 | -27.039258 | 9.002568 | -18.449693 | -19.489673 | 24.202210 | 13.770300 | -63.909698 | -30.462318 | ... | 50.205337 | -11.977495 | -2.651355 | 63.870902 | 33.575529 | -6.873737 | -4.653884 | 33.407272 | -3.028877 | positive |
| 1 | 4.646513 | -27.095638 | -11.011114 | -15.893251 | -7.019846 | -9.543718 | 17.137718 | 32.488901 | -29.658915 | -11.607785 | ... | 23.801546 | 11.551250 | 22.952833 | 43.372218 | 34.946268 | 3.623533 | -17.111816 | 29.811776 | -2.583454 | positive |
| 2 | 27.581128 | -4.100506 | -15.049071 | 1.995729 | -18.761900 | -7.315710 | 37.401362 | 15.316645 | -20.697380 | -5.300695 | ... | 35.133365 | -9.264058 | 15.200469 | 46.843492 | 23.440720 | 1.250978 | -10.280586 | 19.974900 | 13.492671 | positive |
| 3 | 4.278052 | 5.576408 | 4.444432 | -4.695909 | -16.734504 | -0.619649 | 22.183735 | 6.021529 | -20.397884 | -14.555054 | ... | 18.771703 | -4.897337 | 6.664134 | 30.610391 | 19.250141 | 4.208647 | -3.795059 | 12.350482 | 15.395156 | negative |
| 4 | 2.725437 | -20.192761 | -15.919149 | 0.168059 | -8.541662 | -26.854808 | 26.608201 | 8.664394 | -18.287214 | 6.231360 | ... | 43.786424 | 5.449669 | 19.755477 | 72.035188 | 26.304189 | 5.238519 | 1.304170 | 6.507701 | 21.170385 | positive |
5 rows × 101 columns
# Changing the positive to 1, and negative to 0
DataForML['sentiment'] = DataForML['sentiment'].map({'negative':0,'positive':1})
DataForML.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | sentiment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.845066 | 21.074142 | -27.039258 | 9.002568 | -18.449693 | -19.489673 | 24.202210 | 13.770300 | -63.909698 | -30.462318 | ... | 50.205337 | -11.977495 | -2.651355 | 63.870902 | 33.575529 | -6.873737 | -4.653884 | 33.407272 | -3.028877 | 1 |
| 1 | 4.646513 | -27.095638 | -11.011114 | -15.893251 | -7.019846 | -9.543718 | 17.137718 | 32.488901 | -29.658915 | -11.607785 | ... | 23.801546 | 11.551250 | 22.952833 | 43.372218 | 34.946268 | 3.623533 | -17.111816 | 29.811776 | -2.583454 | 1 |
| 2 | 27.581128 | -4.100506 | -15.049071 | 1.995729 | -18.761900 | -7.315710 | 37.401362 | 15.316645 | -20.697380 | -5.300695 | ... | 35.133365 | -9.264058 | 15.200469 | 46.843492 | 23.440720 | 1.250978 | -10.280586 | 19.974900 | 13.492671 | 1 |
| 3 | 4.278052 | 5.576408 | 4.444432 | -4.695909 | -16.734504 | -0.619649 | 22.183735 | 6.021529 | -20.397884 | -14.555054 | ... | 18.771703 | -4.897337 | 6.664134 | 30.610391 | 19.250141 | 4.208647 | -3.795059 | 12.350482 | 15.395156 | 0 |
| 4 | 2.725437 | -20.192761 | -15.919149 | 0.168059 | -8.541662 | -26.854808 | 26.608201 | 8.664394 | -18.287214 | 6.231360 | ... | 43.786424 | 5.449669 | 19.755477 | 72.035188 | 26.304189 | 5.238519 | 1.304170 | 6.507701 | 21.170385 | 1 |
5 rows × 101 columns
# Split the dataframe into training and testing sets using the same random state as before
X = DataForML.drop('sentiment', axis=1)
y = DataForML['sentiment']
X_train, X_test, y_train, y_test = train_test_split(
X,y , random_state=104,test_size=0.25)
X_train.head() #we have the same split as before
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 11681 | 4.570087 | 11.356616 | 1.378616 | 11.755995 | -7.752678 | -6.863883 | 26.874491 | 16.368971 | -4.599308 | 8.510776 | ... | 17.531065 | 29.965890 | 0.394637 | 14.854334 | 36.792583 | 14.869995 | 14.975973 | 1.061858 | 5.724960 | 15.369798 |
| 24009 | 21.086119 | -10.296923 | -3.712598 | 1.085078 | -11.816567 | -9.119838 | 40.764881 | -0.447825 | -17.496902 | -14.338426 | ... | 18.316860 | 15.392579 | -0.441269 | 7.672108 | 34.320870 | 30.942831 | 7.530459 | -8.615611 | 4.526077 | 6.234121 |
| 40502 | -4.762835 | -7.835522 | -3.633352 | -0.749317 | -2.904131 | -3.663638 | 15.119360 | 5.690972 | -17.137478 | -11.269718 | ... | 27.155459 | 29.246034 | -0.548935 | 5.717524 | 36.694034 | 15.558508 | 2.011929 | 2.138965 | 4.805004 | 0.575274 |
| 755 | 0.975955 | -8.224445 | -4.920710 | -3.016067 | -15.312504 | -4.898792 | 7.329217 | 6.384194 | -26.553284 | -13.656131 | ... | 16.070822 | 19.251972 | 2.319530 | -1.982912 | 25.151508 | 19.765591 | -7.650933 | -0.136045 | 13.125170 | 17.002416 |
| 26143 | 2.421924 | -15.065130 | -11.060019 | 4.496135 | -17.723280 | 8.446882 | 16.907165 | 10.839111 | -16.144197 | -3.672041 | ... | 26.475772 | 17.296033 | -4.513053 | 18.116871 | 39.706810 | 36.393804 | 0.474464 | -4.611091 | 11.770387 | 7.825103 |
5 rows × 100 columns
X_test.head() #we have the same split as before
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39550 | 12.459414 | 3.956161 | 2.986157 | -11.152775 | -21.326900 | -11.577348 | 36.277918 | 5.847686 | -37.068659 | -29.201929 | ... | 30.586074 | 45.044685 | -8.056366 | 19.965253 | 73.038487 | 33.088841 | 2.399853 | -13.107485 | 38.969429 | 15.764981 |
| 11244 | 45.670047 | -4.157246 | -26.522301 | -13.792800 | 8.029035 | -41.404722 | 62.447519 | 61.581277 | -57.570185 | -38.053936 | ... | 74.407756 | 88.492903 | 5.269985 | 18.665334 | 166.756333 | 58.480183 | 33.906816 | -48.480392 | 67.569411 | 16.121584 |
| 40728 | 9.379913 | -11.177137 | -0.704772 | 0.174617 | -9.732054 | -14.047008 | 20.401321 | 0.822430 | -17.634648 | -26.720074 | ... | 15.817401 | 22.173254 | 13.011449 | 6.789313 | 39.143032 | 22.923904 | 3.547468 | -1.901817 | 12.155588 | 5.649701 |
| 40580 | 22.969059 | 7.293580 | 2.123036 | 5.554118 | -36.678775 | -18.622362 | 49.615374 | -3.282337 | -33.056823 | -25.475983 | ... | 28.810939 | 41.458976 | 7.458327 | 26.327205 | 67.371688 | 37.125004 | 1.675617 | -17.003198 | 26.606550 | 18.196014 |
| 46371 | 14.967359 | -6.518351 | 8.341882 | -7.194701 | -12.743630 | -34.869573 | 52.300401 | 1.341762 | -42.710234 | -19.837030 | ... | 35.679409 | 43.943890 | 1.770937 | 22.140901 | 67.812764 | 23.423750 | -4.646741 | -1.587722 | 12.092082 | 13.814334 |
5 rows × 100 columns
# Code snippet for sentiment analysis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
clf = LogisticRegression(C=10,penalty='l2', solver='newton-cg')
LOG=clf.fit(X_train,y_train)
# Generating predictions on testing data
prediction=LOG.predict(X_test)
# Measuring accuracy on Testing Data
# Shows precision, recall, F1 score, support, accuracy and lastly, confusion matrix
from sklearn import metrics
print(metrics.classification_report(y_test, prediction))
print(metrics.confusion_matrix(prediction, y_test))
precision recall f1-score support
0 0.86 0.86 0.86 6233
1 0.86 0.86 0.86 6267
accuracy 0.86 12500
macro avg 0.86 0.86 0.86 12500
weighted avg 0.86 0.86 0.86 12500
[[5340 861]
[ 893 5406]]
## Printing the Overall Accuracy of the model
F1_Score=metrics.f1_score(y_test, prediction, average='weighted')
print('Accuracy of the model on Testing Sample Data:', round(F1_Score,2))
Accuracy of the model on Testing Sample Data: 0.86